In this notebook, I will be looking at two wine datasets (one for white wines, one for reds). The goals here are to practice exploratory data analysis and some different visualization techniques, and later to use machine learning models to try to predict wine quality ratings. I would always appreciate any feedback available!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#This will just make the notebook easier to read down the line
import warnings
from operator import itemgetter
warnings.filterwarnings('ignore')
#Read in the two wine datasets
whiteWine = pd.read_csv('Downloads/winequality-white.csv', sep=';')
whiteWine['quality'] = pd.Categorical(whiteWine.quality)
redWine = pd.read_csv('Downloads/winequality-red.csv', sep=';')
redWine['quality'] = pd.Categorical(redWine.quality)
#Check the top of our white wine dataframe
whiteWine.head(5)
#More info about our data
whiteWine.describe()
#We'll start with a pairplot differntiated by the rated wine quality
sns.pairplot(data=whiteWine, hue='quality')
#Check the top of our red wine dataframe
redWine.head(5)
#Get some more info on our data
redWine.describe()
#We'll look at a pairplot of our red wine data now, again differentiated by wine quality
sns.pairplot(data=redWine, hue='quality')
We'll put our two datasets together to look at all of our wine together
#Prep the data for a master dataframe made of the two original sets
whiteWineCopy = whiteWine.copy()
redWineCopy = redWine.copy()
whiteWineCopy['wine type'] = 'white'
redWineCopy['wine type'] = 'red'
#Make dataframe for all our wine and check the head
allWine = whiteWineCopy.append(redWineCopy, ignore_index=True)
allWine.head(5)
#Let's try looking at each of the columns against quality
plt.figure(figsize=(18,25))
plt.subplot(6,2,1)
sns.violinplot(x='quality', y='fixed acidity', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,2)
sns.violinplot(x='quality', y='volatile acidity', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,3)
sns.violinplot(x='quality', y='citric acid', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,4)
sns.violinplot(x='quality', y='residual sugar', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,5)
sns.violinplot(x='quality', y='chlorides', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,6)
sns.violinplot(x='quality', y='free sulfur dioxide', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,7)
sns.violinplot(x='quality', y='total sulfur dioxide', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,8)
sns.violinplot(x='quality', y='density', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,9)
sns.violinplot(x='quality', y='pH', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,10)
sns.violinplot(x='quality', y='sulphates', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
plt.subplot(6,2,11)
sns.violinplot(x='quality', y='alcohol', data=allWine, palette='coolwarm', hue='wine type', split=True, bw=0.3)
#We can look at correlation heatmaps for each of our wines
plt.figure(figsize =(18, 6))
plt.subplot(1,2,1)
plt.title('White Wine Correlation Map')
sns.heatmap(whiteWine.corr(), annot=True, cmap='coolwarm')
plt.subplot(1,2,2)
plt.title('Red Wine Correlation Map')
sns.heatmap(redWine.corr(), annot=True, cmap='coolwarm')
plt.tight_layout()
#Get our quality counts for the next parts
print(whiteWine['quality'].value_counts())
print(redWine['quality'].value_counts())
#Look at some bar plots for the distribution of qualities
names = [i for i in range(0, 11)]
sizeW = [0, 0, 0, 20, 163, 1457, 2198, 880, 175, 5, 0]
sizeR = [0, 0, 0, 10, 53, 681, 638, 199, 18, 0, 0]
plt.figure(figsize =(10, 4))
#White Wine
plt.subplot(1,2,1)
plt.title('Quality distribution (White Wine)')
plt.xlabel('Quality')
plt.ylabel('Count')
sns.barplot(x=names, y=sizeW)
#Red Wine
plt.subplot(1,2,2)
sns.barplot(x=names, y=sizeR)
plt.title('Quality distribution (Red Wine)')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.tight_layout()
#Now look at the breakdown percentages of quality for each wine
plt.figure(figsize =(10, 4))
plt.subplot(1,2,1)
colors=["#396ab1", "#da7c30", "#3e9651", "#cc2529", "#535154", "#6b4c9a", "#922428", "#948b3d"]
colors2=["#7293cb", "#e1974c", "#84ba5b", "#d35e60", "#808585", "#9067a7", "#b16857", "#ccc210"]
#White Wines
plt.title('White Wine Quality Ratings')
names = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
size = [0, 0, 0, 20, 163, 1457, 2198, 880, 175, 5, 0]
percent = [100.*(x/sum(size)) for x in size]
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(names, percent)]
graph, texts = plt.pie(size, colors=colors2)
plt.legend(graph, labels, loc='center left', bbox_to_anchor=(-0.1, 1.),
fontsize=8)
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.subplot(1,2,2)
#Red Wines
plt.title('Red Wine Quality Ratings')
names = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
size = [0, 0, 0, 10, 53, 681, 638, 199, 18, 0, 0]
percent = [100.*(x/sum(size)) for x in size]
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(names, percent)]
graph, texts = plt.pie(size, colors=colors2)
plt.legend(graph, labels, loc='center left', bbox_to_anchor=(-0.1, 1.),
fontsize=8)
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.tight_layout(pad=0, h_pad=0, w_pad=0, rect=[0, 0, 0.85, 1])
We'll try using K Nearest Neighbors, Random Forests, and Support Vector Machines on our datasets to predict wine quality
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, r2_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
#We start with our train test split for each of our datasets
Xw = whiteWine.drop('quality', axis=1)
yw = whiteWine['quality']
Xw_train, Xw_test, yw_train, yw_test = train_test_split(Xw, yw, test_size=0.30)
Xr = redWine.drop('quality', axis=1)
yr = redWine['quality']
Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.30)
#We'll start by figuring out a good n value for our KNN model
error_rate_w = []
for i in range(1,40):
knn_w = KNeighborsClassifier(n_neighbors=i)
knn_w.fit(Xw_train,yw_train)
pred_w_i = knn_w.predict(Xw_test)
error_rate_w.append(np.mean(pred_w_i != yw_test))
#We can look at the graph to see what n values give us lower error rates
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate_w,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
#We will use the n that has the lowest error rate
knn_w = KNeighborsClassifier(n_neighbors=(min(enumerate(error_rate_w), key=itemgetter(1))[0])+1)
knn_w.fit(Xw_train,yw_train)
pred_w_knn = knn_w.predict(Xw_test)
#Here is our classification report and confusion matrix for the KNN model
print(classification_report(yw_test, pred_w_knn))
print(confusion_matrix(yw_test, pred_w_knn))
#Print accuracy scores
print('Using KNN on White Wine data:')
print('The accuracy is ' + str(accuracy_score(yw_test, pred_w_knn)))
#We'll make the model and fit it to the training data using 100 estimators
rfc_w = RandomForestClassifier(n_estimators=100)
rfc_w.fit(Xw_train, yw_train)
pred_w_rfc = rfc_w.predict(Xw_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(yw_test, pred_w_rfc))
print(confusion_matrix(yw_test, pred_w_rfc))
#Print accuracy scores
print('Using a Random Forest on White Wine data:')
print('The accuracy is ' + str(accuracy_score(yw_test, pred_w_rfc)))
#We'll make and fit our SVM model
svc_model_w = SVC()
svc_model_w.fit(Xw_train,yw_train)
pred_w_svc = svc_model_w.predict(Xw_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(yw_test, pred_w_svc))
print(confusion_matrix(yw_test, pred_w_svc))
#Print accuracy scores
print('Using an SVM on White Wine data:')
print('The accuracy is ' + str(accuracy_score(yw_test, pred_w_svc)))
#We'll start by figuring out a good n value for our KNN model, this time we'll skip the visualization
error_rate_r = []
for i in range(1,40):
knn_r = KNeighborsClassifier(n_neighbors=i)
knn_r.fit(Xr_train,yr_train)
pred_i_r = knn_r.predict(Xr_test)
error_rate_r.append(np.mean(pred_i_r != yr_test))
#We will use the n that has the lowest error rate
knn_r = KNeighborsClassifier(n_neighbors=(min(enumerate(error_rate_r), key=itemgetter(1))[0])+1)
knn_r.fit(Xr_train,yr_train)
pred_r_knn = knn_r.predict(Xr_test)
#Here is our classification report and confusion matrix for the KNN model
print(classification_report(yr_test, pred_r_knn))
print(confusion_matrix(yr_test, pred_r_knn))
#Print accuracy scores
print('Using KNN on Red Wine data:')
print('The accuracy is ' + str(accuracy_score(yr_test, pred_r_knn)))
#We'll make the model and fit it to the training data using 100 estimators
rfc_r = RandomForestClassifier(n_estimators=100)
rfc_r.fit(Xr_train, yr_train)
pred_r_rfc = rfc_r.predict(Xr_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(yr_test, pred_r_rfc))
print(confusion_matrix(yr_test, pred_r_rfc))
#Print accuracy scores
print('Using a Random Forest on Red Wine data:')
print('The accuracy is ' + str(accuracy_score(yr_test, pred_r_rfc)))
#We'll make and fit our SVM model
svc_model_r = SVC()
svc_model_r.fit(Xr_train,yr_train)
pred_r_svc = svc_model_r.predict(Xr_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(yr_test, pred_r_svc))
print(confusion_matrix(yr_test, pred_r_svc))
#Print accuracy scores
print('Using an SVM on Red Wine data:')
print('The accuracy is ' + str(accuracy_score(yr_test, pred_r_svc)))
It looks like for both of our wine datasets we're seeing accuracy scores ranging from low 40s to mid 50s using KNN, around the mid to high 50s using our SVM, and around mid to high 60s using Random Forest. Random Forest is proving to be the best on our data, with KNN being the consistent worst. It also appears that our models are more accurate for our red wine data than our white wine data. Still, our accuracy scores are not as high as we might like. This may be because we are trying to classify our data into 6-7 categories. Let's see if we can segment our quality ratings to boost our accuracy.
Instead of having 6-7 categories to classify our quality, we can segment these ratings to a lower number to boost our accuracy. Realistically, we don't need to know the specific quality number of our wines, but maybe it would be better just to know quality as "bad", "medium", or "good". This is likely more relevant and useful to clients, consumers, and ourselves.
#Let's go from 7 classifications to 3
whiteWine3 = whiteWine.copy()
whiteWine3['quality']=pd.cut(whiteWine3['quality'], 3, labels=["bad", "medium", "good"])
redWine3 = redWine.copy()
redWine3['quality']=pd.cut(redWine3['quality'], 3, labels=["bad", "medium", "good"])
#We can check the head of one of our new dataframes to see how quality has changed
whiteWine3.head(5)
#We'll make the model and fit it to the new training data
X3w = whiteWine3.drop('quality', axis=1)
y3w = whiteWine3['quality']
X3w_train, X3w_test, y3w_train, y3w_test = train_test_split(X3w, y3w, test_size=0.30)
X3r = redWine3.drop('quality', axis=1)
y3r = redWine3['quality']
X3r_train, X3r_test, y3r_train, y3r_test = train_test_split(X3r, y3r, test_size=0.30)
#We'll start by figuring out a good n value for our KNN model
error_rate_w3 = []
for i in range(1,40):
knn_w3 = KNeighborsClassifier(n_neighbors=i)
knn_w3.fit(X3w_train,y3w_train)
pred_i_w3 = knn_w3.predict(X3w_test)
error_rate_w3.append(np.mean(pred_i_w3 != y3w_test))
#We will use the n that has the lowest error rate
knn_w3 = KNeighborsClassifier(n_neighbors=min(enumerate(error_rate_w3[1:]), key=itemgetter(1))[0])
knn_w3.fit(X3w_train,y3w_train)
pred_w3_knn = knn_w3.predict(X3w_test)
#Here is our classification report and confusion matrix for the KNN model
print(classification_report(y3w_test, pred_w3_knn))
print(confusion_matrix(y3w_test, pred_w3_knn))
#Print accuracy scores
print('Using KNN on our segmented White Wine data:')
print('The accuracy is ' + str(accuracy_score(y3w_test, pred_w3_knn)))
#We'll make the model and fit it to the training data using 100 estimators
rfc_w3 = RandomForestClassifier(n_estimators=100)
rfc_w3.fit(X3w_train, y3w_train)
pred_w3_rfc = rfc_w3.predict(X3w_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(y3w_test, pred_w3_rfc))
print(confusion_matrix(y3w_test, pred_w3_rfc))
#Print accuracy scores
print('Using a Random Forest on segmented White Wine data:')
print('The accuracy is ' + str(accuracy_score(y3w_test, pred_w3_rfc)))
#We'll make and fit our SVM model
svc_model_w3 = SVC()
svc_model_w3.fit(X3w_train,y3w_train)
pred_w3_svc = svc_model_w3.predict(X3w_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(y3w_test, pred_w3_svc))
print(confusion_matrix(y3w_test, pred_w3_svc))
#Print accuracy scores
print('Using an SVM on segmented White Wine data:')
print('The accuracy is ' + str(accuracy_score(y3w_test, pred_w3_svc)))
#We'll start by figuring out a good n value for our KNN model
error_rate_r3 = []
for i in range(1,40):
knn_r3 = KNeighborsClassifier(n_neighbors=i)
knn_r3.fit(X3r_train,y3r_train)
pred_i_r3 = knn_r3.predict(X3r_test)
error_rate_r3.append(np.mean(pred_i_r3 != y3r_test))
#We will use the n that has the lowest error rate
knn_r3 = KNeighborsClassifier(n_neighbors=min(enumerate(error_rate_r3[1:]), key=itemgetter(1))[0])
knn_r3.fit(X3r_train,y3r_train)
pred_r3_knn = knn_r3.predict(X3r_test)
#Here is our classification report and confusion matrix for the KNN model
print(classification_report(y3r_test, pred_r3_knn))
print(confusion_matrix(y3r_test, pred_r3_knn))
#Print accuracy scores
print('Using KNN on our segmented Red Wine data:')
print('The accuracy is ' + str(accuracy_score(y3r_test, pred_r3_knn)))
#We'll make the model and fit it to the training data using 100 estimators
rfc_r3 = RandomForestClassifier(n_estimators=100)
rfc_r3.fit(X3r_train, y3r_train)
pred_r3_rfc = rfc_r3.predict(X3r_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(y3r_test, pred_r3_rfc))
print(confusion_matrix(y3r_test, pred_r3_rfc))
#Print accuracy scores
print('Using a Random Forest on segmented Red Wine data:')
print('The accuracy is ' + str(accuracy_score(y3r_test, pred_r3_rfc)))
#We'll make and fit our SVM model
svc_model_r3 = SVC()
svc_model_r3.fit(X3r_train,y3r_train)
pred_r3_svc = svc_model_r3.predict(X3r_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(y3r_test, pred_r3_svc))
print(confusion_matrix(y3r_test, pred_r3_svc))
#Print accuracy scores
print('Using an SVM on segmented Red Wine data:')
print('The accuracy is ' + str(accuracy_score(y3r_test, pred_r3_svc)))
It appears that segmenting our data boosted our accuracy for all of our models. We are now seeing accuracies in the mid to high 60s using KNN on white wine data, and even low 80s using KNN on red wine data; mid to high 80s using Random Forest on both wine datasets, and low 70s to mid 80s using SVM on both wine datasets. Again we see Random Forest outperforming the other models, and again the models appear more accurate on our red wine dataset
from sklearn.model_selection import GridSearchCV
param_grid_w = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}
grid_w = GridSearchCV(SVC(),param_grid_w,refit=True)
#We'll make the model and fit it to the training data
Xw_grid = whiteWine3.drop('quality', axis=1)
yw_grid = whiteWine3['quality']
Xw_grid_train, Xw_grid_test, yw_grid_train, yw_grid_test = train_test_split(Xw_grid, yw_grid, test_size=0.30)
# May take awhile!
grid_w.fit(Xw_grid_train,yw_grid_train)
grid_w.best_params_
grid_w.best_estimator_
grid_w_predictions = grid_w.predict(Xw_grid_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(yw_grid_test, grid_w_predictions))
print(confusion_matrix(yw_grid_test, grid_w_predictions))
#Print accuracy scores
print('Using SVM and Gridsearch on segmented White Wine data:')
print('The accuracy is ' + str(accuracy_score(yw_grid_test, grid_w_predictions)))
param_grid_r = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}
grid_r = GridSearchCV(SVC(),param_grid_r,refit=True)
#We'll make the model and fit it to the training data
Xr_grid = redWine3.drop('quality', axis=1)
yr_grid = redWine3['quality']
Xr_grid_train, Xr_grid_test, yr_grid_train, yr_grid_test = train_test_split(Xr_grid, yr_grid, test_size=0.30)
# May take awhile!
grid_r.fit(Xr_grid_train,yr_grid_train)
grid_r.best_params_
grid_r.best_estimator_
grid_r_predictions = grid_r.predict(Xr_grid_test)
#Here is our classification report and confusion matrix for the SVM model
print(classification_report(yr_grid_test, grid_r_predictions))
print(confusion_matrix(yr_grid_test, grid_r_predictions))
#Print accuracy scores
print('Using SVM and Gridsearch on segmented Red Wine data:')
print('The accuracy is ' + str(accuracy_score(yr_grid_test, grid_r_predictions)))
We're not seeing huge differences between the accuracies here and just using an SVM on our segmented data, but it appears that there might be a slight increase in accuracy, at least for the red wine dataset. We're getting similar scores as in our previous results
Now that we've made and run our models, let's look at what features were more or less important in classifying our data. Let's look at our Random Forest models on the segmented white and red wine datasets seeing as how those achieved the highest accuracy scores
importances_white = rfc_w3.feature_importances_
importances_red = rfc_r3.feature_importances_
std_white = np.std([tree.feature_importances_ for tree in rfc_w3.estimators_],
axis=0)
std_red = np.std([tree.feature_importances_ for tree in rfc_r3.estimators_],
axis=0)
indices_white = np.argsort(importances_white)[::-1]
indices_red = np.argsort(importances_red)[::-1]
our_features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
indices_white_name = [our_features[indices_white[k]] for k in range(11)]
indices_red_name = [our_features[indices_red[k]] for k in range(11)]
# Print the feature ranking for white wine
print("Feature ranking for white wine:")
for f in range(X3w_train.shape[1]):
print("%d. %s (%f)" % (f + 1, indices_white_name[f], importances_white[indices_white[f]]))
# Print the feature ranking for red wine
print("Feature ranking for red wine:")
for f in range(X3r_train.shape[1]):
print("%d. %s (%f)" % (f + 1, indices_red_name[f], importances_red[indices_red[f]]))
# Plot the feature importances of the white forest
plt.figure(figsize =(18, 6))
plt.subplot(1,2,1)
plt.title("Feature importances (White Wine)")
plt.bar(range(X3w_train.shape[1]), importances_white[indices_white],
color="#9067a7", yerr=std_white[indices_white], align="center")
plt.xticks(range(X3w_train.shape[1]), indices_white_name, rotation='vertical')
plt.xlim([-1, X3w.shape[1]])
# Plot the feature importances of the red forest
plt.subplot(1,2,2)
plt.title("Feature importances (Red Wine)")
plt.bar(range(X3r_train.shape[1]), importances_red[indices_red],
color="#9067a7", yerr=std_red[indices_red], align="center")
plt.xticks(range(X3r_train.shape[1]), indices_red_name, rotation='vertical')
plt.xlim([-1, X3r.shape[1]])
We saw in this notebook some differences between the qualities of various white and red wines using different EDA and visualization techniques. Then using KNN, Random Forest, and SVM, we saw (initially) mediocre results, ranging in the 30s to 60s, when trying to predict wine quality based off of our other features. After trying these initial models, we segmented our data so that we were assigning our wine to 3 categories instead of the initial 6-7. In doing so, the accuracy scores of all of our models increased substantially. We could have segmented our data into 2 categories instead of 3 to increase accuracy further, but at that point we would start to lose information about the true quality of each wine. After this, we tried using a GridSearch on both of our segmented datasets to boost accuracy further, but the results proved not to be drastically different than the previous results obtained without GridSearch. Overall, the Random Forest proved to have the highest accuracy of all our models, getting accuracy scores up to the high 60s on our original data, and scores up to the high 80s on our segmented data. At the end, we looked at the importance of each of our feautres in the Random Forest models, and it appears that the lists for white and red wine are pretty similar. In both cases, alcohol and volatile acidity are marked as the most important features for our models. In the future it would be interesting to use these results to analytically create the highest quality wine possible